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Abstract 

Processors in large-scale multiprocessors must be able 
to tolerate large communication latencies and synchro- 
nization delays. This paper describes the architecture 
of a rapid-context-switching processor called APRIL 
with support for fine-grain threads and synchroniza- 
tion. APRIL achieves high single-thread performance 
and supports virtual dynamic threads. A commercial 
RISC-based implementation of APRIL and a run-time 
software system that can switch contexts in about 10 
cycles is described. Measurements taken for several par- 
allel applications on an APRIL simulator show that the 
overhead for supporting parallel tasks based on futures 
is reduced by a factor of two over a corresponding im- 
plementation on the Encore Multimax. The scalability 
of a multiprocessor based on APRIL is explored using 
a performance model. We show that the SPARC-based 
implementation of APRIL can achieve close to 80% pro- 
cessor utilization with as few as three resident threads 
per processor in a large-scale cache-based machine with 
an average base network latency of 55 cycles. 

1 Introduction 

The requirements placed on a processor in a large-scale 
multiprocessing environment are different from those in 
a uniprocessing setting. A processor in a parallel ma- 
chine must be able to tolerate high memory latencies 
and handle process synchronization efficiently [2]. This 
need increases as more processors are added to the sys- 
tem. 

Parallel applications impose processing and commu- 
nication bandwidth demands on the parallel machine. 
An efficient and cost-effective machine design achieves a 
balance between the processing power and the commu- 
nication bandwidth provided. An imbalance is created 
when an underutilized processor cannot fully exploit the 
available network bandwidth. When the network has 
bandwidth to spare, low processor utilization can re- 
sult from high network latency. An efficient processor 



design for multiprocessors provides a means for hiding 
latency. When sufficient parallelism exists, a processor 
that rapidly switches to an alternate thread of computa- 
tion during a remote memory request can achieve high 
utilization. 

Processor utilization also diminishes due to synchro- 
nization latency. Spin lock accesses have a low over- 
head of memory requests, but busy-waiting on a syn- 
chronization event wastes processor cycles. Synchro- 
nization mechanisms that avoid busy-waiting through 
process blocking incur a high overhead. 

Full/empty bit synchronization [22] in a rapid context 
switching processor allows efficient fine-grain synchro- 
nization. This scheme associates synchronization infor- 
mation with objects at the granularity of a data word, 
allowing a low-overhead expression of maximum con- 
currency. Because the processor can rapidly switch to 
other threads, wasteful iterations in spin-wait loops are 
interleaved with useful work from other threads. This 
reduces the negative effects of synchronization on pro- 
cessor utilization. 

This paper describes the architecture of APRIL, 
a processor designed for large-scale multiprocessing. 
APRIL builds on previous research on processors for 
parallel architectures such as HEP [22], MASA [8], P- 
RISC [19], [14], [15], and [18]. Most of these processors 
support fine-gram interleaving of instruction streams 
from multiple threads, but suffer from poor single- 
thread performance. In the HEP, for example, instruc- 
tions from a single thread can only be executed once 
every 8 cycles. Single-thread performance is important 
for efficiently running sections of applications with low 
parallelism. 

APRIL does not support cycle-by-cycle interleaving 
of threads. To optimize single-thread performance, 
APRIL executes instructions from a given thread until 
it performs a remote memory request or fails in a syn- 
chronization attempt. We show that such coarse- gram 
multithreading allows a simple processor design with 
context switch overheads of 4-10 cycles, without sig- 
nificantly hurting overall system performance (although 



the pipeline design is complicated by the need to handle 
pipeline dependencies). In APRIL, thread scheduling is 
done in software, and unlimited virtual dynamic threads 
are supported. APRIL supports full/empty bit synchro- 
nization, and provides tag support for futures [9]. In this 
paper the terms process, thread, context, and task are 
used equivalently. 

By taking a systems-level design approach that con- 
siders not only the processor, but also the compiler and 
run-time system, we were able to migrate several non- 
critical operations into the software system, greatly sim- 
plifying processor design. APRIL'S simplicity allows an 
implementation based on minor modifications to an ex- 
isting RISC processor design. We describe such an im- 
plementation based on Sun Microsystem's SPARC pro- 
cessor [23]. A compiler for APRIL, a run-time system, 
and an APRIL simulator are operational. We present 
simulation results for several parallel applications on 
APRIL'S efficiency in handling fine-grain threads and 
assess the scalability of multiprocessors based on a 
coarse-grain multithreaded processor using an analyt- 
ical model. Our SPARC-based processor supports four 
hardware contexts and can switch contexts in about 10 
cycles, which yields roughly 80% processor utilization 
in a system with an average base network latency of 55 
cycles. 

The rest of this paper is organized as follows. Sec- 
tion 2 is an overview of our multiprocessor system archi- 
tecture and the programming model. The architecture 
of APRIL is discussed in Section 3, and its instruction 
set is described in Section 4. A SPARC-based imple- 
mentation of APRIL is detailed in Section 5. Section 6 
discusses the implementation and performance of the 
APRIL run-time system. Performance measurements of 
APRIL based on simulations are presented in Section 7. 
We evaluate the scalability of multithreaded processors 
in Section 8. 



2 The ALEWIFE System 

APRIL is the processing element of ALEWIFE, a large- 
scale multiprocessor being designed at MIT. ALEWIFE 
is a cache-coherent machine with distributed, globally- 
shared memory. Cache coherence is maintained using 
a directory-based protocol [5] over a low-dimension di- 
rect network [20]. The directory is distributed with the 
processing nodes. 

2.1 Hardware 

As shown in Figure 1, each ALEWIFE node consists of 
a processing element, floating-point unit, cache, main 
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Figure 1: ALEWIFE node. 



memory, cache/directory controller and a network rout- 
ing switch. Multiple nodes are connected via a direct, 
packet-switched network. 

The controller synthesizes a global shared memory 
space via messages to other nodes, and satisfies requests 
from other nodes directed to its local memory. It main- 
tains strong cache coherence [7] for memory accesses. 
On exception conditions, such as cache misses and failed 
synchronization attempts, the controller can choose to 
trap the processor or to make the processor wait. A 
multithreaded processor reduces the ill effects of the 
long-latency acknowledgment messages resulting from 
a strong cache coherence protocol. To allow experimen- 
tation with other programming models, the controller 
provides special mechanisms for bypassing the coher- 
ence protocol and facilities for preemptive interproces- 
sor interrupts and block transfers. 

The ALEWIFE system uses a low-dimension direct 
network. Such networks scale easily and maintain high 
nearest-neighbor bandwidth. However, the longer ex- 
pected latencies of low-dimension direct networks com- 
pared to indirect multistage networks increase the need 
for processors that can tolerate long latencies. Further- 
more, the lower bandwidth of direct networks over indi- 
rect networks with the same channel width introduces 
interesting design tradeoffs. 

In the ALEWIFE system, a context switch occurs 
whenever the network must be used to satisfy a re- 
quest, or on a failed synchronization attempt. Since 
caches reduce the network request rate, we can em- 
ploy coarse-grain multithreading (context switch ev- 
ery 50-100 cycles) instead of fine-grain multithreading 
(context switch every cycle). This simplifies proces- 
sor design considerably because context switches can be 
more expensive (4 to 10 cycles), and functionality such 
as scheduling can be migrated into run-time software. 
Single-thread performance is optimized, and techniques 



used in RISC processors for enhancing pipeline perfor- 
mance can be applied [10]. Custom design of a process- 
ing element is not required in the ALEWIFE system; 
indeed, we are using a modified version of a commercial 
RISC processor for our first-round implementation. 

2.2 Programming Model 

Our experimental programming language for ALEWIFE 
is Mul-T [16], an extended version of Scheme. Mul-T's 
basic mechanism for generating concurrent tasks is the 
future construct. The expression (future X) , where 
X is an arbitrary expression, creates a task to evaluate 
X and also creates an object known as a future to even- 
tually hold the value of X . When created, the future 
is in an unresolved, or undetermined, state. When the 
value of X becomes known, the future resolves to that 
value, effectively mutating into the value of X. Con- 
currency arises because the expression (future X) re- 
turns the future as its value without waiting for the 
future to resolve. Thus, the computation containing 
(future X) can proceed concurrently with the evalu- 
ation of X. All tasks execute in a shared address-space. 

The result of supplying a future as an operand of 
some operation depends on the nature of the operation. 
Non-strict operations, such as passing a parameter to 
a procedure, returning a result from a procedure, as- 
signing a value to a variable, and storing a value into a 
field of a data structure, can treat a future just like any 
other kind of value. Strict operations such as addition 
and comparison, if applied to an unresolved future, are 
suspended until the future resolves and then proceed, 
using the value to which the future resolved as though 
that had been the original operand. 

The act of suspending if an object is an unresolved 
future and then proceeding when the future resolves is 
known as touching the object. The touches that auto- 
matically occur when strict operations are attempted 
are referred to as implicit touches. Mul-T also includes 
an explicit touching or "strict" primitive (touch X) 
that touches the value of the expression X and then 
returns that value. 

Futures express control-level parallelism. In a large 
class of algorithms, data parallelism is more appropri- 
ate. Barriers are a useful means of synchronization for 
such applications on MIMD machines, but force unnec- 
essary serialization. The same serialization occurs in 
SIMD machines. Implementing data-level parallelism 
in a MIMD machine that allows the expression of maxi- 
mum concurrency requires cheap fine-grain synchroniza- 
tion associated with each data object. We provide this 
support in hardware with full/empty bits. 

We are augmenting Mul-T with constructs for data- 



level parallelism and primitives for placement of data 
and tasks. As an example, the programmer can use 
f uture-on which works just like a normal future but 
allows the specification of the node on which to schedule 
the future. Extending Mul-T in this way allows us to 
experiment with techniques for enhancing locality and 
to research language-level issues for programming par- 
allel machines. 

3 Processor Architecture 

APRIL is a pipelined RISC processor extended with 
special mechanisms for multiprocessing. This section 
gives an overview of the APRIL architecture and fo- 
cuses on its features that support multithreading, fine- 
grain synchronization, cheap futures, and other models 
of computation. 

The left half of Figure 2 depicts the user- visible pro- 
cessor state comprising four sets of general purpose reg- 
isters, and four sets of Program Counter (PC) chains 
and Processor State Registers (PSR). The PC chain 
represents the instruction addresses corresponding to 
a thread, and the PSR holds various pieces of process- 
specific state. Each register set, together with a single 
PC-chain and PSR, is conceptually grouped into a single 
entity called a task frame (using terminology from [8]). 
Only one task frame is active at a given time and is 
designated by a current frame pointer (FP). All reg- 
ister accesses are made to the active register set and 
instructions are fetched using the active PC-chain. Ad- 
ditionally, a set of 8 global registers that are always 
accessible (regardless of the FP) is provided. 

Registers are 32 bits wide. The PSR is also a 32-bit 
register and can be read into and written from the gen- 
eral registers. Special instructions can read and write 
the FP register. The PC-chain includes the Program 
Counter (PC) and next Program Counter (nPC) which 
are not directly accessible. This assumes a single-cycle 
branch delay slot. Condition codes are set as a side 
effect of compute instructions. A longer branch delay 
might be necessary if the branch instruction itself does a 
compare so that condition codes need not be saved [13]; 
in this case the PC chain is correspondingly longer. 
Words in memory have a 32 bit data field, and have 
an additional synchronization bit called the full/empty 
bit. 

Use of multiple register sets on the processor, as in the 
HEP, allows rapid context switching. A context switch 
is achieved by changing the frame pointer and empty- 
ing the pipeline. The cache controller forces a context 
switch on the processor, typically on remote network re- 
quests, and on certain unsuccessful full/empty bit syn- 
chronizations. 
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Figure 2: Processor State and Virtual Threads. 

APRIL implements futures using the trap mechanism. 
For our proposed experimental implementation based 
on SPARC, which does not have four separate PC and 
PSR frames, context switches are also caused through 
traps. Therefore, a fast trap mechanism is essential. 
When a trap is signalled in APRIL, the trap mechanism 
lets the pipeline empty and passes control to the trap 
handler. The trap handler executes in the same task 
frame as the thread that trapped so that it can access 
all of the thread's registers. 

3.1 Coarse-Grain Multithreading 

In most processor designs to date (e.g. [8, 22, 19, 15]), 
multithreading has involved cycle-by-cycle interleaving 
of threads. Such fine-grain multithreading has been 
used to hide memory latency and also to achieve high 
pipeline utilization. Pipeline dependencies are avoided 
by maintaining instructions from different threads in the 
pipeline, at the price of poor single-thread performance. 
In the ALEWIFE machine, we are primarily con- 
cerned with the large latencies associated with cache 
misses that require a network access. Good sin- 
gle thread performance is also important. Therefore 
APRIL continues executing a single thread until a mem- 
ory operation involving a remote request (or an unsuc- 
cessful synchronization attempt) is encountered. The 
controller forces the processor to switch to another 
thread, while it services the request. This approach is 
called coarse-gram multithreading. Processors in mes- 
sage passing multicomputers [21, 27, 6, 4] have tra- 
ditionally taken this approach to allow overlapping of 
communication with computation. 



Context switching in APRIL is achieved by changing 
the frame pointer. Since APRIL has four task frames, 
it can have up to four threads loaded. The thread that 
is being executed resides in the task frame pointed to 
by the FP. A context switch simply involves letting the 
processor pipeline empty while saving the PC-chain and 
then changing the FP to point to another task frame. 

Threads in ALEWIFE are virtual. Only a small sub- 
set of all threads can be physically resident on the pro- 
cessors; these threads are called loaded threads. The re- 
maining threads are referred to as unloaded threads and 
live on various queues in memory, waiting their turn 
to be loaded. In a sense, the set of task frames acts 
like a cache on the virtual threads. This organization 
is illustrated in Figure 2. The scheduler tries to choose 
threads from the set of loaded threads for execution to 
minimize the overhead of saving and restoring threads 
to and from memory. When control eventually passes 
back to the thread that suffered a remote request, the 
controller should have completed servicing the request, 
provided the other threads ran for enough cycles. By 
maximizing local cache and memory accesses, the need 
for context switching reduces to once every 50 or 100 
cycles, which allows us to tolerate latencies in the range 
of 150 to 300 cycles with 4 task frames (see Section 8). 

Rapid context switching is used to hide the latency 
encountered in several other trap events, such as syn- 
chronization faults (or attempts to load from "empty" 
locations). These events can either cause the proces- 
sor to suspend execution (wait) or to take a trap. In 
the former case, the controller holds the processor until 
the request is satisfied. This typically happens on lo- 
cal memory cache misses, and on certain full/empty bit 
tests. If a trap is taken, the trap handling routine can 
respond by: 

1. spinning - immediately return from the trap and 
retry the trapping instruction. 

2. switch spinning - context switch without unloading 
the trapped thread. 

3. blocking - unload the thread. 

The above alternatives must be considered with care 
because incorrect choices can create or exacerbate star- 
vation and thrashing problems. An extreme example 
of starvation is this: all loaded threads are spinning 
or switch spinning on an exception condition that an 
unloaded thread is responsible for fulfilling. We are in- 
vestigating several possible mechanisms to handle such 
problems, including a special controller initiated trap 
on certain failed synchronization tests, whose handler 
unloads the thread. 



An important aspect of the ALEWIFE system is its 
combination of caches and multithreading. While this 
combination is advantageous, it also creates a unique 
class of thrashing and starvation problems. For exam- 
ple, forward progress can be halted if a context execut- 
ing on one processor is writing to a location while a con- 
text on another processor is reading from it. These two 
contexts can easily play "cache tag" , since writes to a lo- 
cation force a context switch and invalidation of other 
cached copies, while reads force a context switch and 
transform read-write copies into read-only copies. An- 
other problem involves thrashing between an instruction 
and its data; a context will be blocked if it has a load 
instruction mapped to the same cache line as the tar- 
get of the load. These and related problems have been 
addressed with appropriate hardware interlock mecha- 



3.2 Support for Futures 

Executing a Mul-T program with futures incurs two 
types of overhead not present in sequential programs. 
First, strict operations must check their operands for 
availability before using them. Second, there is a cost 
associated with creating new threads. 

Detection of Futures Operand checks for futures 
done in software imply wasted cycles on every strict 
operation. Our measurements with Mul-T running on 
an Encore Multimax show that this is expensive. Even 
with clever compiler optimizations, there is close to a 
factor of two loss in performance over a purely sequen- 
tial implementation (see Table 3). Our solution em- 
ploys a tagging scheme with hardware-generated traps 
if an operand to a strict operator is a future. We believe 
that this hardware support is necessary to make futures 
a viable construct for expressing parallelism. From an 
architectural perspective, this mechanism is similar to 
dynamic type checking in Lisp. However, this mecha- 
nism is necessary even in a statically typed language in 
the presence of dynamic futures. 

APRIL uses a simple data type encoding scheme for 
automatically generating a trap when operands to strict 
operators are futures. This implementation (discussed 
in Section 5) obviates the need to explicitly inspect 
in software the operands to every compute instruction. 
This is important because we do not want to hurt the 
efficiency of all compute instructions because of the pos- 
sibility an operand is a future. 

Lazy Task Creation Little can be done to reduce the 
cost of task creation if future is taken as a command 
to create a new task. In many programs the possibility 



of creating an excessive number of fine-grain tasks ex- 
ists. Our solution to this problem is called lazy task cre- 
ation [17]. With lazy task creation a future expression 
does not create a new task, but computes the expression 
as a local procedure call, leaving behind a marker indi- 
cating that a new task could have been created. The 
new task is created only when some processor becomes 
idle and looks for work, stealing the continuation of that 
procedure call. Thus, the user can specify the maximum 
possible parallelism without the overhead of creating a 
large number of tasks. The race conditions are resolved 
using the fine-grain locking provided by the full/empty 
bits. 

3.3 Fine-grain synchronization 

Besides support for lazy task creation, efficient fine- 
grain synchronization is essential for large-scale parallel 
computing. Both the dataflow and data-parallel models 
of computation rely heavily on the availability of cheap 
fine-grain synchronization. The unnecessary serializa- 
tion imposed by barriers in MIMD implementations of 
data-parallellism can be avoided by allowing fine-grain 
word-level synchronization in data structures. The tra- 
ditional test&set based synchronization requires extra 
memory operations and separate data storage for the 
lock and for the associated data. Busy-waiting or block- 
ing in conventional processors waste additional proces- 
sor cycles. 

APRIL adopts the full/empty bit approach used in 
the HEP to reduce both the storage requirements and 
the number of memory accesses. A bit associated with 
each memory word indicates the state of the word: full 
or empty. The load of an empty location or the store 
into a full location can trap the processor causing a 
context switch, which helps hide synchronization delay. 
Traps also obviate the additional software tests of the 
lock in test&set operations. A similar mechanism is 
used to implement I-structures in dataflow machines [3], 
however APRIL is different in that it implements such 
synchronizations through software trap handlers. 

3.4 Multimodel Support Mechanisms 

APRIL is designed primarily for a shared-memory mul- 
tiprocessor with strongly coherent caches. However, 
we are considering several additional mechanisms which 
will permit explicit management of caches and efficient 
use of network bandwidth. These mechanisms present 
different computational models to the programmer. 

To allow software-enforced cache coherence, we have 
loads and stores that bypass the hardware coherence 
mechanism, and a flush operation that permits soft- 
ware writeback and invalidation of cache lines. A loaded 
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Table 1: Basic instruction set summary. 



context has a fence counter that is incremented for 
each dirty cache line that is flushed and decremented 
for each acknowledgement from memory. This fence 
counter may be examined to determine if all writebacks 
have completed. We are proposing a block-transfer 
mechanism for efficient transfer of large blocks of data. 
Finally, we are considering an interprocessor-interrupt 
mechanism (IPI) which permits preemptive messages 
to be sent to specific processors. IPIs offer reasonable 
alternatives to polling and, in conjunction with block- 
transfers, form a primitive for the message-passing com- 
putational model. 

Although each of these mechanisms adds complex- 
ity to our cache controller, they are easily implemented 
in the processor through "out-of-band" instructions as 
discussed in Section 5. 



4 Instruction Set 

APRIL has a basic RISC instruction set augmented 
with special memory instructions for full/empty bit op- 
erations, multithreading, and cache support. The at- 
traction of an implementation based on simple SPARC 
processor modifications has resulted in a basic SPARC- 
like design. All registers are addressed relative to a cur- 
rent frame pointer. Compute instructions are 3-address 
register-to-register arithmetic/logic operations. Condi- 
tional branch instructions take an immediate operand 
and may increment the PC by the value of the immedi- 
ate operand depending on the condition codes set by the 
arithmetic/logic operations. Memory instructions move 
data between memory and the registers, and also inter- 
act with the cache and the full/empty bits. The basic 
instruction categories are summarized in Table 1. The 
remainder of this section describes features of APRIL 
instructions used for supporting multiprocessing. 

Data Type Formats APRIL supports tagged point- 
ers for Mul-T, as in the Berkeley SPUR processor [12], 
by encoding the pointer type in the low order bits of a 
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Figure 3: Data Type Encodings. 

data word. Associating the type with the pointer has 
the advantage of saving an additional memory reference 
when accessing type information. Figure 3 lists the dif- 
ferent type encodings. An important purpose of this 
type encoding scheme is to support hardware detection 
of futures. 

Future Detection and Compute Instructions 

Since a compute instruction is a strict operation, special 
action has to be taken if either of its operands is a fu- 
ture. APRIL generates a trap if a future is encountered 
by a compute instruction. Future pointers are easily 
detected by their non-zero least significant bit. 

Memory Instructions Memory instructions are 
complex because they interact with the full/empty bits 
and the cache controller. On a memory access, two data 
exceptions can occur: the accessed location may not be 
in the cache (a cache miss), and the accessed location 
may be empty on a load or full on a store (a full/empty 
exception). On a cache miss, the cache/directory con- 
troller can trap the processor or make the processor 
wait until the data is available. On full/empty excep- 
tions, the controller can trap the processor, or allow the 
processor to continue execution. Load instructions also 
have the option of setting the full/empty bit of the ac- 
cessed location to empty while store instructions have 
the option of setting the bit to full. These options give 
rise to 8 kinds of loads and 8 kinds of stores. The load 
instructions are listed in Table 2. Store instructions are 
similar except that they trap on full locations instead 
of empty locations. 

A memory instruction also shares responsibility for 
detecting futures in either of its address operands. Like 
compute instructions, memory instructions also trap 
if the least significant bit of either of their address 
operands are non-zero. This introduces the restriction 
that objects in memory cannot be allocated at byte 
boundaries. This, however, is not a problem because 
object allocation at word boundaries is favored for other 
reasons [11]. This trap provides support for implicit fu- 
ture touches in operators that dereference pointers, e.g., 
car in LISP. 
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Table 2: Load Instructions. 



to devote our limited resources to the design of a custom 
ALEWIFE cache and directory controller, rather than 
to processor design. Second, the register windows in 
the SPARC processor permit a simple implementation 
of coarse-grain multithreading. Third, most of the in- 
structions envisioned for the original APRIL processor 
map directly to single or double instruction sequences 
on the SPARC. Software compatibility with a commer- 
cial processor allows easy access to a large body of soft- 
ware. Furthermore, use of a standard processor permits 
us to ride the technology curve; we can take advantage 
of new technology as it is developed. 



Full/Empty Bit Conditional Branch Instructions 

Non-trapping memory instructions allow testing of the 
full/empty bit by setting a condition bit indicating the 
state of the memory word's full/empty bit. APRIL 
provides conditional branch instructions, Jfull and 
Jempty, that dispatch on this condition bit. This pro- 
vides a mechanism to explicitly control the action taken 
following a memory instruction that would normally 
trap on a full/empty exception. 

Frame Pointer Instructions Instructions are pro- 
vided for manipulating the register frame pointer (FP). 
FP points to the register frame on which the currently 
executing thread resides. An IICFP instruction incre- 
ments the FP to point to the next task frame while 
a DECFP instruction decrements it. The incrementing 
and decrementing is done modulo the number of task 
frames. RDFP reads the value of the FP into a register 
and STFP writes the contents of a register into the FP. 

Instructions for Other Mechanisms The special 
mechanisms discussed in Section 3.4, such as FLUSH 
are made available through "out-of-band" instructions. 
Interprocessor-interrupts, block-transfers, and FENCE 
operations are initiated via memory-mapped I/O in- 
structions (LDIO, STIO). 

5 An Implementation of APRIL 

An ALEWIFE node consists of several interacting sub- 
systems: processor, floating-point unit, cache, memory, 
cache and directory controller, and network controller. 
For the first round implementation of the ALEWIFE 
system, we plan to use a modified SPARC processor 
and an unmodified SPARC floating-point unit. 1 There 
are several reasons for this choice. First, we have chosen 



The SPARC-based implementation effort is in collaboration 
with LSI Logic Corporation. 



Rapid Context Switching on SPARC SPARC 

processors contain an implementation-dependent num- 
ber of overlapping register windows for speeding up pro- 
cedure calls. The current register window is altered 
via SPARC instructions (SAVE and RESTORE) that mod- 
ify the Current Window Pointer (CWP). Traps incre- 
ment the CWP, while the trap return instruction (RETT) 
decrements it. SPARC'S register windows are suited for 
rapid context switching and rapid trap handling because 
most of the state of a process (i.e., its 24 local reg- 
isters) can be switched with a single-cycle instruction. 
Although we are not using multiple register windows for 
procedure calls within a single thread, this should not 
significantly hurt performance [25, 24]. 

To implement coarse-grain multithreading, we use 
two register windows per task frame - a user window 
and a trap window. The SPARC processor chosen for 
our implementation has eight register windows, allow- 
ing a maximum of four hardware task frames. Since 
the SPARC does not have multiple program counter 
(PC) chains and processor status registers (PSR), our 
trap code must explicitly save and restore the PSRs 
during context switches (the PC chain is saved by the 
trap itself). These values are saved in the trap window. 
Because the SPARC has a minimum trap overhead of 
five cycles (for squashing the pipeline and computing 
the trap vector), context switches will take at least this 
long. See Section 6.1 for further information. 

The SPARC floating-point unit does not support reg- 
ister windows, but has a single, 32-word register file. 
To retain rapid context switching ability for applica- 
tions that require efficient floating point performance, 
we have divided the floating point register file into four 
sets of eight registers. This is achieved by modifying 
floating-point instructions in a context dependent fash- 
ion as they are loaded into the FPU and by maintaining 
four different sets of condition bits. A modification of 
the SPARC processor will make the CWP available ex- 
ternally to allow insertion into the FPU instruction. 



Support for Futures We detect futures on the 
SPARC via two separate mechanisms. Future point- 
ers are tagged with their lowest bit set. Thus, direct 
use of a future pointer is flagged with a word- alignment 
trap. Furthermore, a strict operation, such as subtrac- 
tion, applied to one or more future pointers is flagged 
with a modified non-fixnum trap, that is triggered if an 
operand has its lowest bit set (as opposed to either one 
of the lowest two bits, in the SPARC specification). 

Implementation of Loads and Stores The 

SPARC definition includes the Alternate Space Indi- 
cator (ASI) feature that permits a simple implementa- 
tion of APRIL'S many load and store instructions (de- 
scribed in Section 4). The ASI is available externally as 
an eight-bit field. Normal memory accesses use four of 
the 256 ASI values to indicate user/supervisor and in- 
struction/data accesses. Special SPARC load and store 
instructions (LDASI and STASI) permit use of the other 
252 ASI values. Our first-round implementation uses 
different ASI values to distinguish between flavors of 
load and store instructions, special mechanisms, and 
I/O instructions. 

Interaction with the Cache Controller The cache 
controller in the ALEWIFE system maintains strong 
cache coherence, performs full/empty bit synchroniza- 
tion, and implements special mechanisms. By examin- 
ing the processor's ASI bits during memory accesses, 
it can select between different load/store and synchro- 
nization behavior, and can determine if special mecha- 
nisms should be employed. Through use of the Memory 
Exception (MEXC) line on SPARC, it can invoke syn- 
chronous traps corresponding to cache misses and syn- 
chronization (full/empty) mismatches. The controller 
can suspend processor execution using the MHOLD 
line. It passes condition information to the processor 
through the Coprocessor Condition bits (CCCs), per- 
mitting the full/empty conditional branch instructions 
(Jfull and Jempty) to be implemented as coprocessor 
branch instructions. Asynchronous traps (IPI's) are de- 
livered via the SPARC'S asynchronous trap lines. 



tern includes the trap and system routines, Mul-T run- 
time support, a scheduler, and a system boot routine. 

Since a large portion of the support for multithread- 
ing, synchronization and futures is provided in soft- 
ware through traps and run-time routines, trap han- 
dling must be fast. Below, we describe the implemen- 
tation and performance of the routines used for trap 
handling and context switching. 

6.1 Cache Miss and Full/Empty Traps 

Cache miss traps occur on cache misses that require 
a network request and cause the processor to context 
switch. Full/empty synchronization exceptions can oc- 
cur on certain memory instructions described in Sec- 
tion 4. The processor can respond to these exceptions 
by spinning, switch spinning, or blocking the thread. 
In our current implementation, traps handle these ex- 
ceptions by switch spinning, which involves a context 
switch to the next task frame. 

In our SPARC-based design of APRIL, we implement 
context switching through the trap mechanism using 
instructions that change the CWP. The following is a 
trap routine that context switches to the thread in the 
next task frame. 



rdpsr psrreg 

save 

save 

wrpsr psrreg 

jmpl rl7 

rett rl8 



save PSR into a reserved reg. 
increment the window pointer 
by 2 

restore PSR for the new context 
return from trap and 
reexecute trapping instruction 



We count 5 cycles for the trap mechanism to allow 
the pipeline to empty and save relevant processor state 
before passing control to the trap handler. The above 
trap handler takes an additional 6 cycles for a total of 11 
cycles to effect the context switch. In a custom APRIL 
implementation, the cycles lost due to PC saves in the 
hardware trap sequence, and those in calling the trap 
handler for the PSR saves/restores and double incre- 
menting the frame pointer could be avoided, allowing a 
four-cycle context switch. 



6 Compiler and Run-Time Sys- 
tem 

The compiler and run-time system are integral parts 
of the processor design effort. A Mul-T compiler for 
APRIL and a run-time system written partly in APRIL 
assembly code and partly in T have been implemented. 
Constructs for user-directed placement of data and pro- 
cesses have also been implemented. The run-time sys- 



6.2 Future Touch Trap 

When a future touch trap is signalled, the future that 
caused the trap will be in a register. The trap han- 
dler has to decode the trapping instruction to find that 
register. The future is resolved if the full/empty bit of 
the future's value slot is set to full. If it is resolved, 
the future in the register is replaced with the resolved 
value; otherwise the trap routine can decide to switch 
spin or block the thread that trapped. Our future touch 



trap handler takes 23 cycles to execute if the future is 
resolved. 

If the trap handler decides to block the thread on an 
unresolved future, the thread must be unloaded from 
the hardware task frame, and an alternate thread may 
be loaded. Loading a thread involves writing the state of 
the thread, including its general registers, its PC chain, 
and its PSR, into a hardware task frame on the pro- 
cessor, and unloading a thread involves saving the state 
of a thread out to memory. Loading and unloading 
threads are expensive operations unless there is special 
hardware support for block movement of data between 
registers and memory. Since the scheduling mechanism 
favors processor-resident threads, loading and unload- 
ing of threads should be infrequent. However, this is an 
issue that is under investigation. 

7 Performance Measurements 

This section presents some results on APRIL'S perfor- 
mance in handling fine-grain tasks. We have imple- 
mented a simulator for the ALEWIFE system written 
in C and T. Figure 4 illustrates the organization of the 
simulator. The Mul-T compiler produces APRIL code, 
which gets linked with the run-time system to yield an 
executable program. The instruction-level APRIL pro- 
cessor simulator interprets APRIL instructions. It is 
written in T and simulates 40,000 APRIL instructions 
per second when run on a SPARCServer 330. The pro- 
cessor simulator interacts with the cache and directory 
simulator (written in C) on memory instructions. The 
cache simulator in turn interacts with the network sim- 
ulator (also written in C) when making remote memory 
operations. The simulator has proved to be a useful 
tool in evaluating system-wide architectural tradeoff's 
as it provides more accurate results than a trace driven 
simulation. The speed of the simulator has allowed us 
to execute lengthy parallel programs. As an example, in 
a run of speech (described below), the simulated pro- 
gram ran for 100 million simulated cycles before com- 
pleting. 

Evaluation of the ALEWIFE architecture through 
simulations is in progress. A sampling of our results on 
the performance of APRIL running parallel programs 
is presented here. Table 3 lists the execution times of 
four programs written in Mul-T: fib, factor, queens 
and speech, fib is the ubiquitous doubly recursive Fi- 
bonacci program with 'future's around each of its re- 
cursive calls, factor finds the largest prime factor of 
each number in a range of numbers and sums them up, 
queens finds all solutions to the n-queens chess prob- 
lem for n = 8 and speech is a modified Viterbi graph 
search algorithm used in a connected speech recognition 
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Figure 4: Simulator Organization. 

system called SUMMIT, developed by the Spoken Lan- 
guage Systems Group at MIT. We ran each program 
on the Encore Multimax, on APRIL using normal task 
creation, and on APRIL using lazy task creation. For 
purposes of comparison, execution time has been nor- 
malized to the time taken to execute a sequential version 
of each program, i.e., with no futures and compiled with 
an optimizing T-compiler. 

The difference between running the same sequential 
code on T and on Mul-T on the Encore Multimax 
(columns "T seq" and "Mul-T seq") is due to the over- 
head of future detection. Since the Encore does not 
support hardware detection of futures, an overhead of a 
factor of 2 is introduced, even though no futures are ac- 
tually created. There is no overhead on APRIL, which 
demonstrates the advantage of tag support for futures. 

The difference between running sequential code on 
Mul-T and running parallel code on Mul-T with one 
processor ("Mul-T seq" and 1) is due to the overhead 
of thread creation and synchronization in a parallel pro- 
gram. This overhead is very large for the fib benchmark 
on both the Encore and APRIL using normal task cre- 
ation because of very fine-grain thread creation. This 
overhead accounts for approximately a factor of 28 in 
execution time. For APRIL with normal futures, this 
overhead accounts for a factor of 14. Lazy task cre- 
ation on APRIL creates threads only when the machine 
has the resources to execute them, and performs much 
better because it has the effect of dynamically partition- 
ing the program into coarser-grain threads and creating 





T 


Mul-T 


Program 


System 


seq 


seq 


1 


2 


4 


8 


16 


fib 


Encore 


1.0 


1.8 


28.9 


16.3 


9.2 


5.1 




APRIL 


1.0 


1.0 


14.2 


7.1 


3.6 


1.8 


0.97 


Apr-lazy 


1.0 


1.0 


1.5 


0.78 


0.44 


0.29 


0.19 


factor 


Encore 


1.0 


1.4 


1.9 


0.96 


0.50 


0.26 




APRIL 


1.0 


1.0 


1.8 


0.90 


0.45 


0.23 


0.12 


Apr-lazy 


1.0 


1.0 


1.0 


0.52 


0.26 


0.14 


0.09 


queens 


Encore 


1.0 


1.8 


2.1 


1.0 


0.54 


0.31 




APRIL 


1.0 


1.0 


1.4 


0.67 


0.33 


0.18 


0.10 


Apr-lazy 


1.0 


1.0 


1.0 


0.51 


0.26 


0.13 


0.07 


speech 


Encore 


1.0 


2.0 


2.3 


1.2 


0.62 


0.36 




APRIL 


1.0 


1.0 


1.2 


0.60 


0.31 


0.17 


0.10 


Apr-lazy 


1.0 


1.0 


1.0 


0.52 


0.27 


0.15 


0.09 



Table 3: Execution time for Mul-T benchmarks. "T seq" is T running sequential code, 
sequential code, 1 to 16 denote number of processors running parallel code. 



"Mul-T seq" is Mul-T running 



fewer futures. The overhead introduced is only a fac- 
tor of 1.5. In all of the programs, APRIL consistently 
demonstrates lower overhead due to support for thread 
creation and synchronization over the Encore. 

Measurements for multiple processor executions on 
APRIL (2 - 16) used the processor simulator without 
the cache and network simulators, in effect simulating a 
shared-memory machine with no memory latency. The 
numbers demonstrate that APRIL and its run-time sys- 
tem allow parallel program performance to scale when 
synchronization and task creation overheads are taken 
into account, but when memory latency is ignored. The 
effect of communication in large-scale machines depends 
on several factors such as scheduling, which are active 
areas of investigation. 

8 Scalability of Multithreaded 
Processor Systems 

Multithreading enhances processor efficiency by allow- 
ing execution to proceed on alternate threads while the 
memory requests of other threads are being satisfied. 
However, any new mechanism is useful only if it en- 
hances overall system performance. This section ana- 
lyzes the system performance of multithreaded proces- 
sors. 

A multithreaded processor design must address the 
tradeoff between reduced processor idle time and in- 
creased cache miss rates, network contention, and con- 
text management overhead. The private working sets of 
multiple contexts interfere in the cache. The added in- 



terference misses coupled with the higher average traffic 
generated by a higher utilized processor impose greater 
bandwidth demands on the interconnection network. 
Context management instructions required to switch 
the processor between threads also add to the over- 
head. Furthermore, the application must display suf- 
ficient parallelism to allow multiple thread assignment 
to each processor. 

What is a good performance metric to evaluate mul- 
tithreading? A good measure of system performance is 
system power, which is the product of the number of 
processors and the average processor utilization. Pro- 
vided the computation of processor utilization takes into 
account the deleterious effects of cache, network, and 
context-switching overhead, the processor utilization is 
itself a good measure. 

We have developed a model for multithreaded pro- 
cessor utilization that includes the cache, network, and 
switching overhead effects. A detailed analysis is pre- 
sented in [1]. This section will summarize the model 
and our chief results. Processor utilization U as a func- 
tion of the number of threads resident on a processor 
p is derived as a function of the cache miss rate m(p), 
the network latency T(p), and the context switching 
overhead C: 



U(p) 



1+T(p)m(p) 



1+T(p)m(p) iU1 r ^ 1+Cm(p) 



fo r p > ^+T(p)m(p) 



'l\ 



1+Cm(p) r — 1 + Cm(p) 

When the number of threads is small, complete over- 
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lapping of network latency is not possible. Processor 
utilization with one thread is l/(l + m(l)T(l)). Ideally, 
with p threads available to overlap network delays, the 
utilization would increase p-fold. In practice, because 
the miss rate and network latency increase to m(p) and 
T(p), the utilization becomes p/(l + m(p)T(p)). 

When it is possible to completely overlap network 
latency, processor utilization is limited only by the con- 
text switching overhead paid on every miss (assuming 
a context switch happens on a cache miss), and is given 
by 1/(1 + m(p)C). 

The models for the cache and network terms have 
been validated through simulations. Both these terms 
are shown to be the sum of two components: one com- 
ponent independent of the number of threads p and the 
other linearly related to p (to first order). Multithread- 
ing is shown to be useful when p is small enough that 
the fixed components dominate. 

Let us look at some results for the default set of sys- 
tem parameters given in Table 4. The analysis assumes 
8000 processors arranged in a three dimensional array. 
In such a system, the average number of hops between a 
random pair of nodes is nk/3 = 20, where n denotes net- 
work dimension and k its radix. This yields an average 
round trip network latency of 55 cycles for an unloaded 
network, when memory latency and average packet size 
are taken into account. The fixed miss rate comprises 
first-time fetches of blocks into the cache, and the inter- 
ference due to multiprocessor coherence invalidations. 



Parameter 


Value 


Memory latency 


10 cycles 


Network dimension n 


3 


Network radix k 


20 


Fixed miss rate 


2% 


Average packet size 


4 


Cache block size 


16 bytes 


Thread working set size 


250 blocks 


Cache size 


64 Kbytes 



Table 4: Default system parameters. 

Figure 5 displays processor utilization as a function of 
the number of threads resident on the processor when 
context switching overhead is 10 cycles. The degree 
to which the cache, network, and overhead components 
impact overall processor utilization is also shown. The 
ideal curve shows the increase in processor utilization 
when both the cache miss rate and network contention 
correspond to that of a single process, and do not in- 
crease with the degree of multithreading p. 

We see that as few as three processes yield close to 
80% utilization for a ten-cycle context-switch overhead 



Ideal 

Network Effects 

Cache and Network Effects 

CS Overhead 

Useful Work 



1.0 1- 




6 7 8 

Processes p 

Figure 5: Relative sizes of the cache, network and overhead 
components that affect processor utilization. 



which corresponds to our initial SPARC-based imple- 
mentation of APRIL. This result is similar to that re- 
ported by Weber and Gupta [26] for coarse-grain mul- 
tithreaded processors. The main reason a low degree of 
multithreading is sufficient is that context switches are 
forced only on cache misses, which are expected to hap- 
pen infrequently. The marginal benefits of additional 
processes is seen to decrease due to network and cache 
interference. 

Why is utilization limited to a maximum of about 
0.80 despite an ample supply of threads? The reason is 
that available network bandwidth limits the maximum 
rate at which computation can proceed. When avail- 
able network bandwidth is used up, adding more pro- 
cesses will not improve processor utilization. On the 
contrary, more processes will degrade performance due 
to increased cache interference. In such a situation, 
for better system performance, effort is best spent in 
increasing the network bandwidth, or in reducing the 
bandwidth requirement of each thread. 

The relatively large ten-cycle context switch overhead 
does not significantly impact performance for the de- 
fault set of parameters because utilization depends on 
the product of context switching frequency and switch- 
ing overhead, and the switching frequency is expected 



11 



to be small in a cache-based system. This observation 
is important because it allows a simpler processor im- 
plementation, and is exploited in the design of APRIL. 
A multithreaded processor requires larger caches to 
sustain the working sets of multiple processes, although 
cache interference is mitigated if the processes share 
code and data. For the default parameter set, we found 
that caches greater than 64 Kbytes comfortably sus- 
tain the working sets of four processes. Smaller caches 
suffer more interference and reduce the benefits of mul- 
tithreading. 

9 Conclusions 

We described the architecture of APRIL - a coarse- 
grain multithreaded processor to be used in a cache- 
coherent multiprocessor called ALEWIFE. By rapidly 
switching to an alternate task, APRIL can hide com- 
munication and synchronization delays and achieve high 
processor utilization. The processor makes effective use 
of available network bandwidth because it is rarely idle. 
APRIL provides support for fine-grain tasking and de- 
tection of futures. It achieves high single-thread perfor- 
mance by executing instructions from a given task until 
an exception condition like a synchronization fault or re- 
mote memory operation occurs. Coherent caches reduce 
the context switch rate to approximately once every 50- 
100 cycles. Therefore context switch overheads in the 
4-10 cycle range are tolerable, significantly simplifying 
processor design. By providing hardware support only 
for performance-critical operations and migrating other 
functionality into the compiler and run-time system, we 
were able to simplify the processor design even further. 
We described a SPARC-based implementation of 
APRIL that uses the register windows of SPARC as 
task frames for multiple threads. A processor simulator 
and an APRIL compiler and run-time system have been 
written. The SPARC-based implementation of APRIL 
switches contexts in 11 cycles. APRIL and its asso- 
ciated run-time system practically eliminate the over- 
head of fine-grain task creation and detection of fu- 
tures. For Mul-T, the overhead reduces from 100% 
on an Encore Multimax-based implementation to under 
5% on APRIL. We evaluated the scalability of multi- 
threaded processors in large-scale parallel machines us- 
ing an analytical model. For typical system parameters 
and a 10 cycle context-switch overhead, the processor 
can achieve close to 80% utilization with 3 processor 
resident threads. 
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